Crawling Web Pages with Support for Client-Side Dynamism

نویسندگان

Manuel Álvarez

Alberto Pan

Juan Raposo

Justo Hidalgo

چکیده

There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling mod-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Workload-Aware Web Crawling and Server Workload Detection

With the development of search engines, more and more web crawlers are used to gather web pages. The rising crawling traffic has brought the concern that crawlers may impact web sites. On the other hand, more efficient crawling strategy is required for the coverage and freshness of search engine index. In this paper, crawlers of several major search engines are analyzed using one six-months acc...

متن کامل

Introducing Dynamic Ranking on Web Pages Based on Multiple Ontology Supported Domains

Search Engine ensures efficient Web-page ranking and retrieving. Page ranking is typically used for displaying the Web-pages at client-side. We are going to introduce a data structural model for retrieval of the searched Webpages. We propose two algorithms in this paper. The first algorithm constructs the Index Based Acyclic Graph generated by multiple ontologies supported crawling and the seco...

متن کامل

Crawling JavaScript websites using WebKit – with application to analysis of hate speech in online discussions

JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when JavaScript CSHW from an external website is seamlessly included as part of the web pages. We have developed a prototype web crawler that efficiently extracts content...

متن کامل

Crawling the client-side hidden web

There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually called hidden web data. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Crawling Web Pages with Support for Client-Side Dynamism

نویسندگان

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

Workload-Aware Web Crawling and Server Workload Detection

Introducing Dynamic Ranking on Web Pages Based on Multiple Ontology Supported Domains

Crawling JavaScript websites using WebKit – with application to analysis of hate speech in online discussions

Crawling the client-side hidden web

عنوان ژورنال:

اشتراک گذاری